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Analysis  of  a  Parallelissd  Nonlinear 
Elliptic  Boundary  Valne  Problem  Solver 
with  Application  to  Reacting  Flows 
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and 

Mitchell  D.  Smooke 
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Abstract:  A  parallelized  finite  difference  code  based  on  Newton’s  method  for  systems  of  non¬ 
linear  elliptic  boundary  value  problems  in  two  dimensions  is  anadyzed  in  terms  of  computational 
complexity  and  parallel  efficiency.  An  approximate  cost  function  depending  on  15  dimensionless 
parameters  (including  discrete  problem  dimensions,  convergence  parameters,  and  machine  charac¬ 
teristics)  is  derived  for  algorithms  based  on  stripwise  and  boxwbe  decompositions  of  the  domain 
and  a  one-to-one  assignment  of  the  strip  or  box  subdomains  to  processors.  The  sensitivity  of  the 
cost  function  to  the  parameters  is  explored  in  regions  of  parameter  space  corresponding  to  model 
small-order  systems  with  inexpensive  function  evaluations  and  also  a  coupled  system  of  nineteen 
equations  with  very  expensive  function  evaluations  (a  reacting  flow  model  of  engineering  interest 
which  motivates  the  work).  The  algorithm  has  been  implemented  on  the  Intel  Hypercube,  and 
some  experimental  results  for  the  model  problems  with  stripwise  decompositions  are  presented  and 
compared  with  the  theory.  In  the  context  of  computational  combustion  problems,  multiprocessors 
of  either  message-passing  or  shared-memory  type  may  be  employed  with  stripwise  decompositions 
to  reedize  speedups  of  0(n),  where  n  is  mesh  resolution  in  one  direction,  for  reasonable  n.  To  realize 
speedups  of  0{n^),  the  total  number  of  mesh  points,  only  hypercubes  appear  attractive.  These 
results  must  be  qualified  by  haurdware  assumptions,  including  sufficient  local  memory  per  processor 
to  hold  all  of  the  data  defined  on  the  associated  subdomain,  and  selection  of  machine  parameters 
typical  of  presently  commercially  available  components. 


t  This  rssesrch  wm  supported  by  the  Nutionul  Aeronautics  and  Space  AdmiiustratioB  under  NASA 
Contract  No.  NASl-18107  and  AFOSR  8S-0189  wkik  the  first  author  was  in  resideBce  at  ICASE,  NASA 
Langley  Research  Center,  Hampton,  VA  3SM8. 
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1.  Introduction 

Systems  of  nonlinear  elliptic  boundary  value  problems  (BVPs)  constitute  an  important  class 
of  problems  in  large-scale  scientific  computation.  In  many  instances,  including  the  example  con¬ 
sidered  herein  of  the  detailed  modeling  of  steady  multidimensional  reacting  flows,  evaluation  of 
the  governing  equation  residuals  at  a  given  solution  iterate  constitutes  a  significant  expense,  so 
methods  which  make  efficient  use  of  the  function  evaluations  are  required.  For  such  problems, 
robust  variations  of  Newton’s  method  are  often  preferable  to  less  fully  coupled  iterative  methods  or 
associated  time-marching  methods  (see,  t.g.,  [12]),  if  memory  capacity  allows,  as  we  tacitly  assume 
it  does  throughout  thb  paper. 

In  contrast,  for  many  other  systems  of  nonlinear  elliptic  BVPs,  such  as  the  steady  incompress¬ 
ible  Navier-Stokes  problem,  the  cost  of  a  residual  function  evaluation  is  much  less  than  the  cost 
of  forming  and  solving  the  large  linear  systems  arising  at  each  stage  of  Newton’s  method,  so  that 
various  time  or  time-like  relaxation  methods  requiring  less  implicitness  have  traditionally  been  pref- 
ered.  Explicit  methods  generally  parallelize  very  efficiently,  and  might  thus  appear  to  be  superior 
to  more  implicit  methods  as  parallel  algorithms,  even  where  this  might  not  be  true  in  the  serial 
context.  We  demonstrate  through  complexity  estimates  that  there  is  a  wide  range  of  parameter 
space  over  which  Newton-based  methods  also  pairallelize  within  acceptable  ranges  of  computational 
efficiency,  and  that  this  region  includes  reacting  flow  problems  of  current  interest. 

The  computational  complexity  model  is  introduced  in  general  terms,  but  is  ultimately  made 
problem-specific  for  the  purpose  of  reducing  the  15  parameters  of  the  model  to  a  manageable  number 
of  interesting  ones.  Some  of  these  parameters  are  unavailable  or  impractical  to  obtain  from  theory 
alone,  and  are  supplied  empirically  from  actual  serial  runtimes.  Parallel  numerical  experiments  for 
realistic  reacting  flows  are  as  yet  impractical  on  the  hardware  conveniently  available  to  the  authors 
for  this  study,  but  results  on  model  problems  in  representative  ranges  for  the  parameters  n,  the 
resolution  of  the  grid,  and  p,  the  number  of  processors  employed,  support  various  testable  results 
of  the  complexity  analysis. 

The  organization  of  this  paper  is  as  follows.  In  section  2  we  sketch  in  general  terms  a  mod¬ 
ified  Newton  algorithm  for  the  solution  of  nonlinear  elliptic  boundary  value  problems  by  finite 
discretization  methods.  Section  3  describes  a  serial  implementation  of  thb  algorithm  which  has  re¬ 
cently  been  applied  successfully  to  the  computation  of  an  axbymmetric,  over- ventilated,  subsonic, 
laminar  methane-air  jet  diffusion  flame.  Parallel  implementation  issues  and  a  complexity  theory 
are  presented  in  section  4.  Section  5  contains  some  actual  f  'tormance  data  for  model  systems  ob¬ 
tained  on  the  Intel  Hypercube,  and  a  dbcussion  of  its  impli<'  -  Dns  for  modeling  realbtic  systems. 
We  conclude  in  section  6  with  some  recommendations  for  the  further  development  of  the  parallel 
computation  of  reacting  flows. 

2.  An  Algorithm  for  Nonlinear  Elliptic  Boundary  Value  Problems 

2.1.  Newton*!  Method  for  a  Generic  Nonlinear  System 

We  consider  in  thb  section  a  discretbed  elliptic  system  in  the  generic  form 

mx,»)»0,  (2.1) 


where  d  &  vector  of  unknowns,  x  ^<1  ^  ^  vectors  of  parameters,  and  F  b  a  vector-valued 
function  in  which  the  components  of  X>  ^  ^Qter  noniinearly.  For  the  sake  of  definiteness 
in  the  following  dbcussion,  we  assume  that  F  arises  from  a  low-order  finite  difference  dbcretbation 
on  a  tensor-product  grid  in  two  dimensions,  with  m  gridpoints  in  one  direction  and  n  in  the  other, 
and  that  there  are  r  unknowns  defined  at  each  gridpoint.  The  unknown  vector  b  then  comprised 
of  rmn  elements,  including  boundary  values,  conveniently  enumerated  by  triple  subscripts  as 
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for  k  =  1, and  F  is  likewise  comprised  of  rmn  scalar  equations, 
including  discrete  boundary  conditions.  The  vector  of  parameters  x  consists  of  a  gridfunctions 
which  are  not  necessarily  independent  of  the  unknowns  but  are  in  general  given  by  a  set  of 
explicit  algebraic  relationships  which  involve  only  unknowns  defined  at  a  single  point  of  the  form 

Xlij  —  Xti4>Ujf  •  •  1  ^ri>) 


for  /  =  1, . . . ,5;  t  =  1,. . . ,m;j  =  1,. . . ,n. 

The  reason  for  distinguishing  between  the  ^  and  x  gridfunctions  is  that  the  equations  for  the 
latter  contain  no  spatial  derivatives  and  what  would  otherwise  be  a  system  of  order  (r+s)mn  can  be 
condensed  algebraically  with  modest  effort  to  a  system  of  order  rmn.  The  parameters  v  are  globed 
parameters  which  are  independent  of  the  unknowns  As  an  illustration  of  this  partitioning  in  the 
modeling  of  reacting  flows,  the  local  mass  fraction  of  a  species  belongs  in  its  local  coefficient  of 
diffusion  in  the  gas  mixture  belongs  in  Xi  a^d  the  activation  energy  of  its  formation  in  a  particular 
elementary  reaction  belongs  in  r.  The  parameters  x  and  ir  will  be  suppressed  in  what  follows,  but 
their  presence  is  felt  through  the  terms  involving  s  in  the  complexity  model  in  section  4. 

The  system  (2.1)  may  be  solved  efficiently  by  a  daunped  modified  Newton  method  provided 
that  an  initial  iterate  sufficiently  close  to  the  solution  is  supplied.  The  damped  modified 
Newton  iteration  is  given  by 

^(*+1)  =  ^(*)  +  (2.2) 

where 

=  (jW)-iF(^(*)),  (2.3) 

where  the  matrix  is  an  approximation  to  the  actual  Jacobiam  matrix  evaluated  at  the 
iterate.  We  refer  to  as  the  k**  update.  When  =  i  and  for  all  k, 

a  pure  Newton  method  is  obtained. 

For  the  pure  Newton  method,  the  iteration  (2.2)  possesses  at  least  a  quadratic  rate  of  con* 
vergence  when  the  hypotheses  of  the  Kantorovich  theorem  aure  satisfied.  That  is,  —  ^*||  < 

cJJ^(*)  _  0*||2  for  some  constant  c.  For  the  modified  Newton  method  without  restarting  (i.e.,  with 
7'*^  =  fof  all  »»  >  0),  the  rate  of  convergence  is  not  guaramteed  to  be  better  tham  lineau’. 

Consequently,  there  is  a  traideoff  to  be  mauie  between  the  higher  cost  per  iteration  of  the  pure 
Newton  method  and  the  higher  iteration  count  of  a  modified  Newton  method.  Thb  tradeoff  has 
been  variously  resolved  in  the  literature.  We  follow  the  praM:tice  of  restairting  our  modified  Newton 
method  whenever  the  norm  of  the  current  update  divided  by  the  norm  of  the  first  update  which 
was  obtauned  with  the  Jatcobiam  in  current  use  fails  to  satisfy  a  bound  derived  from  the  Kamtorovich 
theorem.  Thb  bound  depends  only  on  the  index  of  the  current  update  relative  to  that  of  the  last 
restart  (for  detaub,  see  [13]). 

When  the  sbe  of  the  successive  modified  Newton  steps  b  decreasing  in  accord  with  the  Kam¬ 
torovich  theorem,  no  daunping  b  necessary  (A^^l  =  1).  Since  there  b  no  pratctical  meams  for  ensuring 
that  the  initiad  iterate  lies  within  the  domain  of  convergence  defined  by  the  theorem,  damping  b 
often  necessary  in  the  early  stages  to  prevent  divergence.  In  our  implementation,  we  daunp 
for  either  of  two  reasons.  First,  if  bounds  aie  specified  by  the  user  for  certadn  components  of 
a1*1  b  chosen  as  the  minimum  of  1  and  of  the  value  which  would  drive  amy  component  of  to 

its  bound.  Second,  if  the  projected  next  update  based  on  damping  with  the  provbional  A^^l, 

=  (7(*1)-*F(^1*>  +  A^tf^W), 

b  larger  than  the  current  update  then  A1*1  b  recursively  halved  until  thb  condition  no 

longer  holds,  or  until  A1*1  becomes  smadler  than  some  tolerance.  In  the  latter  case,  the  algorithm 
unsuccessfully  damps  to  death,  and  a  restart  with  a  better  initial  condition  b  recommended. 
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2.2.  A  Pgeado-transient  Continaation  Procednr* 

A  better  initial  condition  for  Newton’s  method  is  usually  obtainable  by  an  elementary  contin¬ 
uation  procedure,  namely  driving  a  pseudo-transient  form  of  (2.1), 

D^  +  n«  =  0,  (2.4) 


where  D  is  a  scaling  matrix,  part  of  the  way  towards  its  steady  state  [15].  If  (2.4)  is  implicitly 
time-differenced  with  the  backward  Euler  method,  using  time  step  At  from  a  given  initial  state  ^ 
a  new  system, 


l*(^)  =  D^^^-f-F(^)  =  0, 


(2.5) 


results,  which  possesses  the  Jacobian 


J  = 


_D  ££ 


With  an  appropriate  D  and  a  sufficiently  small  At,  (2.5)  may  in  principal  be  iterated  under  the 
successive  substitution  ^  ^  as  a  time-accurate  solution  of  a  physically  transient  problem.  This 

is  generally  infeasible  in  terms  of  execution  time,  and  possibly  unproductive  because  of  spatial 
truncation  errors  which  remain  finite  even  as  the  time  step  becomes  infinitesimal.  (The  form  of  the 
algorithm  we  describe  in  this  paper  leaves  out  consideration  of  spatial  grid  adaptivity.  This  feature 
is  important  in  combustion  problems  and  has  been  implemented  in  the  serial  version  of  the  code, 
but  not  yet  in  the  parallel  version,  where  load  balancing  considerations  are  necessitated.) 

For  a  succession  of  time  steps  which  may  be  larger  than  what  would  be  required  for  accurate 
resolution  of  the  physical  transient,  the  iteration  of  (2.5)  is  a  numerically  robust  procedure  for 
generating  an  initial  iterate  for  Newton’s  method.  Moreover,  due  to  the  close  relationship  of  (2.5) 
to  (2.1),  only  minor  modifications  of  the  steady-state  code  for  (2.1)  are  required  to  make  it  a 
transient/steady-state  hybrid.  Adaptive  control  of  the  size  of  the  time  steps  used  in  (2.5),  can 
enhance  the  efficiency  of  such  a  hybrid  algorithm.  One  useful  strategy  is  to  choose  At  based 
on  the  temporal  truncation  error  of  the  most  rapidly  varying  component  of  the  solution.  As  ^ 
approaches  and  At  becomes  large,  this  effects  a  smooth  transition  to  the  (infinite  time  step) 
steady-state  formulation.  Alternatively,  the  growth  of  At  beyond  a  certain  At^ax  can  be  used  to 
trigger  automatically  a  transition  to  the  steady-state  formulation. 


2.2.  Algorithmic  Implementation  of  the  Hybrid  Solver 

The  Appendix  contains  two  procedures,  named  SOLVER  amd  NEWTON,  which  represent 
an  implementation  of  the  ideas  of  thb  section.  SOLVER  operates  in  either  of  two  modes:  a 
direct  treatment  of  the  steady-state  problem,  or  the  steady-state  problem  preceded  by  one  or  more 
adaptive  time  steps.  It  has  five  arguments:  the  name  of  a  user-supplied  subprocedure  F  (simply 
passed  to  the  internal  procedure  NEWTON)  which  evaluates  the  transient  or  steady-state  residual, 
depending  upon  the  mode;  a  real  solution  vector  x  for  communicating  the  initial  iterate  upon 
invocation  and  the  converged  solution  upon  a  successful  return;  an  integer  ni,-,n«  for  setting  the 
mode  (if  num*  >  0,  time-stepping  is  attempted  for  up  to  nu^  adaptively  chosen  steps);  an  initial 
time  step  Aii,  which  is  ignored  if  name  =  0;  and  a  flag,  flagsoLVERy  to  indicate  the  state  of  the 
procedure  upon  return.  There  are  two  other  flags  in  SOLVER  which  function  as  logical  control 
parameters:  flagsEWTOff,  which  communicates  the  state  of  the  called  procedure  NEWTON  upon 
its  return,  and  flagjACOBiAN,  which  determines  the  frequency  at  which  the  Jacobian  is  updated. 
The  finest  level  of  control  over  the  Jacobian  update  frequency  lies  inside  of  NEWTON  itself,  but 
at  a  coarse  level  SOLVER  forces  a  refreshing  of  the  Jacobian  at  the  outset  of  a  calculation  in  either 
mode  and  also  after  an  unsuccessful  time  step.  There  are  also  two  real  control  parameters  internal  to 
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SOLVER:  a  minimum  time  step  Atmm  and  a  maximum  time  step  Atmax-  During  normal  execution, 
the  time  step  At  is  adjusted  upward  or  downward  according  to  the  progress  of  the  iterations  in 
a  subprocedure  which  may  be  problem-specific  amd  is  regarded  as  a  black  box  for  purposes  of 
this  algorithm.  Atmin  is  s  lower  bound  on  the  size  of  the  time  step  to  keep  the  computation 
out  of  inefficient  realms.  If  the  procedure  attempts  to  set  a  time  step  smaller  than  Atmin  it  is 
terminated  with  a  “damped  to  death  in  transient  mode”  message.  Recovery  from  thb  termination 
calls  for  a  new  initial  iterate  or  a  willingness  to  expend  lots  of  CPU  time,  as  indicated  by  a  new, 
lower,  input  value  of  AtmiVi*  ^  procedure  attempts  to  set  a  time  step  greater  tham  Atmas< 
direct  consideration  of  the  steady-state  problem  is  deemed  feasible,  and  the  mode  is  automatically 
switched.  The  steady-state  mode  is  also  switched  in  by  default  after  the  number  of  time  steps 
specified  by  niima  b  exhausted.  A  failure  during  the  steady-state  computation  b  indicated  with 
the  flag  “damped  to  death  in  steady-state  mode”. 

The  procedure  named  NEWTON  b  driven  by  procedure  SOLVER.  The  mode  dbtinctions  of 
SOLVER  are  invbible  to  NEWTON;  only  the  meaning  of  the  argument  F  changes  with  mode,  not 
the  way  F  b  handled  within  NEWTON  .  The  procedure  has  seven  arguments:  the  name  of  a  user- 
supplied  procedure  F  which  it  calb  to  evaluate  a  transient  or  steady-state  residual;  a  real  solution 
vector  X  for  communicating  the  initiad  iterate  for  the  steady  or  transient  F  upon  invocation  and  the 
converged  solution  upon  return;  an  integer  tttime,  a  real  solution  vector  from  the  previous  time  step 
and  a  current  time  step  Ait,  all  three  of  which  aure  simply  passed  through  NEWTON  to  F; 
a  flag  flagsEWTON  to  indicate  the  state  of  the  procedure  upon  return,  and  a  flag  flagjACOBlAS 
to  control  the  frequency  of  updating  of  the  Jacobian.  The  internal  parameter  m  counts  the  number 
of  iterations  since  the  current  Jacobian  was  last  updated  and  serves  as  an  index  into  a  table  of 
convergence  bounds  from  [13],  denoted  arm  below.  There  aure  three  internal  control  parameters,  as 
follows.  The  parameter  nmoUfied  b  the  maximum  number  of  iterations  between  Jatcobian  updates 
even  if  the  modified  iterations  aure  making  their  theoretically  amticipated  progress.  The  paurameter 
Amin  b  a  lower  bound  on  the  size  of  the  adaptively  chosen  damping  pairameter,  which  b  designed 
to  abort  a  calculation  started  outside  of  the  domain  of  convergence.  Finally,  ^eonvtr$t  ia  an  absolute 
convergence  toleramce,  chosen  with  due  consideration  to  the  scaling  of  the  components  of  x  and 
to  the  machine  precbion.  Note  that  convergence  of  NEWTON  b  based  on  the  (theoretically 
motivated)  norm  of  the  update,  not  on  the  norm  of  the  residual  itself.  The  size  of  the  residual  at 
convergence  depends  on  the  condition  number  of  the  Jacobian  matrix,  and  may  be  quite  large  in 
detailed  kinetics  combustion  problems  even  after  convergence  of  the  solution  to  machine  precbion. 

The  level  of  algorithmic  detail  furnbhed  in  the  Appendix  b  sufficient  to  identify  the  five  basic 
tasks  which  together  account  for  admost  adl  of  the  execution  time  required  by  the  code.  In  order  of 
increasing  importance  for  reacting  flow  calculations,  they  aire:  (1)  DAXPY  vector  arithmetic,  (2) 
the  evaduation  of  norms,  (3)  the  solution  of  linear  equations  involving  the  Jacobian  matrix,  (4)  the 
evaduation  of  governing  equation  residuads,  and  (5)  the  evaduation  of  Jcxobiams. 

S.  The  Physical  Model  and  a  Serial  Implementation 
S.l.  The  Continuous  Governing  System 

The  form  of  the  governing  equations  determines  to  a  large  extent  the  choice  of  method  for 
the  linear  adgebra  amd  the  meams  for  evaduating  the  Jacobiam.  For  definiteness,  we  cite  without 
derivation  the  following  set  of  elliptic  boundary  value  problems  (see,  e.;.,[3]).  Though  typical  of 
mamy  reacting  flow  modeb,  they  do  not  include  effects  which  will  in  some  contexts  be  essentiad 
(e.g.,  radiation,  turbulence),  amd  which  could  drasticadly  alter  the  algebraic  form  of  the  governing 
system.  The  problem  motivating  the  present  work  b  a  two-dimensional  axisymmetric  bobaric 
lamiinar  diffusion  flame. 

Let  r  amd  z  denote  the  radial  amd  axiad  directions,  Vr  and  v,  the  respective  velocity  components. 
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auad  p  the  density.  We  introduce  the  compressible  Stokes  streamfunction  V*  such  that 


dif)  dx/j 

fTVr  =  and  prV.  = 


and  the  vorticity 


(j  = 


dVr  dVt 


dz  dr  ’ 

The  species  concentrations  by  mass  are  yjt,  for  /;  =  1,2, where  K  is  the  total  number 
of  species  in  the  reaction  mechanism.  The  temperature  is  T.  These  principal  fields  satisfy  the 
following  equations. 

Streamifunction: 


3/1  di(>\  ^  ^ 

dz  \rp  dz  )  dr  \rp  dr  ) 


+  u;  =  0. 


(3.1) 


Vorticity: 


[h  (f^)  -  Tr  (tI!)]  -  Tr  H  (^))  -  h  H  (?“)) 
+.>,|£4.r’v(:i±^).i8op  =  0. 

Species  Conservation; 

I  (»'‘|7)  - 1  (''‘It)  +  Tr  + 1 


(3.2) 


—rWiiWic  =  0,  fc  =  1,2,. . . ,if. 


(3.3) 


Internal  Energy: 


(3.4) 


Other  parameters  appearing  in  this  system  are:  the  mixture  viscosity  p,  the  acceleration  of 
gravity  g,  the  diffusion  velocities  of  the  species  in  the  mixture  (Vt,,  V*.),  the  specific  heat  of  the 
species  Cpk,  the  specific  heat  of  the  mixture  Cp,  the  thermal  conductivity  of  the  mixture  A,  the 
molecular  mass  of  the  k**  species  Wt,  and  the  generation/consumption  rate  of  the  k*^  species  wt. 
The  system  is  closed  with  the  multicomponent  ideal  gas  law. 


P  = 


pFT 

i2T’ 


where  W  is  the  mixture  molecular  mass  and  p  is  the  pressure.  The  Arrhenius  reaction  rate  law  for 
a  system  of  J  reactions  gives 

y-i 


(3.5) 
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where  the  forward  and  reverse  rates  are  given  by 


(3.6) 


where 

A:/  =  AjT»'exp(-Bj/RT),  kr  =  kj/K],  (3.7) 

and  where,  in  turn  Ay,  the  pre-exponential  rate  constants,  0j,  the  temperature  exponents,  , 
the  activation  energies,  and  Kj,  the  equilibrium  constants,  are  assumed  known.  The  i/jk  are  t^e 
stoichiometric  coefficients  of  the  A;**  species  in  the  j**  resiction.  Superscripts  P  and  R  on  the  i/y* 
denote  the  product  and  reactant  sides  of  the  stoichiometric  equation,  respectively.  Thermodynamic 
and  constitutive  equations  provide  the  temperature-  and  composition-dependent  specific  heats, 
viscosities,  thermal  conductivities,  and  diffusion  velocities. 

For  the  calculation  of  the  and  the  thermodynamic  and  constitutive  properties,  we  use  the 
convenient  FORTRAN  code  packages  documented  in  [6]  and  (7),  respectively.  Forming  the  effective 
mixture  transport  coefficients  (/i.  A,  and  the  Vjt)  by  appropriate  averaging  of  more  fundamental 
quantities  defined  by  the  kinetic  theory  of  gases  or  empirical  tables  is  by  fair  the  most  expensive 
part  of  evaluating  the  governing  equation  residuab.  However,  in  diffusion  flaunes,  accurate  diffusive 
transport  modeb  aure  essential.  One  dimensional  studies  [8]  have  shown,  for  instance,  that  the 
location  of  a  diffusion-controlled  reaction  front  b  shifted  significant  dbtamces  relative  to  the  width 
of  the  fuel-oxidizer  mixing  layer  as  the  exponent  of  temperature  vauriation  in  a  very  simple  transport 
model  b  varied  within  a  smaill  neighborhood. 

3.2.  Dbcretbatlon  of  the  Governing  System 

The  governing  equations,  along  with  appropriate  boundary  conditions,  are  differenced  on  a 
two-dimensional  tensor  product  grid  which  (in  the  serial  code)  b  generated  adaptively  from  am 
initial  coarse  grid  by  subequidbtribution  of  gradients  and  curvatures  of  the  solution  components. 
Thb  concentrates  grid  points  in  the  regions  of  high-activity  (fronts  and  peaks)  in  the  domain. 
Second-order  differences  are  used  throughout  except  for  gradient  boundary  conditions  and  for 
the  convective  terms  of  the  vorticity,  energy  and  species  equations,  in  which  first-order  upwind 
differences  are  employed.  Thb  discretization  can  be  accommodated  within  the  standard  nine-point 
stencil,  and,  apart  from  the  source  term,  insures  the  diagonal  dominance  of  the  Jacobian. 

Ordering  the  solution  components  at  each  gridpoint  within  a  lexicographical  ordering  of  the 
gridpoints  (radid  index  varying  more  rapidly  than  axial  index)  results  in  a  Jacobian  which  has  the 
block  nine-diagonal  structure  indicated  in  Figure  1.  To  quantify  the  overaJl  sparsity,  let  there  be 
m  and  n  gridpoints  in  the  radial  and  axial  coordinate  directions,  respectively,  and  r  unknowns  per 
gridpoint.  The  r  x  r  blocks  must  be  assumed  fully  dense  to  accommodate  the  most  general  kinetic 
mechanbm  and  composition-dependence  of  the  transport  properties.  The  fraction  of  Jacobian 
elements  which  are  nonzero  b  then  approximately  9/(nm).  For  m  and  n  on  the  order  of  30,  which  b 
certainly  below  the  minimum  required  for  well-resolved  reaction  zones,  thb  number  b  approximately 
1%.  It  b  therefore  natural  to  use  a  block  relaxation  method,  in  which  only  the  diagonal  blocks  are 
factored  by  a  d'  . 't  method.  The  relaxation  may  be  carried  out  either  by  throwing  the  six  blocks  in 
the  wings  of  the  Jacobian  to  the  right-hand  side  and  treating  the  three  contiguous  diagonal  blocks 
about  the  main  diagonal  implicitly  (the  block- line  method),  or  of  throwing  all  eight  off-diagonal 
blocks  to  the  right-hand  side  (the  block-point  method).  As  the  outermost  loop  in  the  relaxation 
process  used  to  solve  the  linear  problems  at  each  Newton  step,  the  domain  b  swept  repeatedly 
in  the  axial  direction  from  upstream  to  downstream.  Thb  takes  advantage  of  the  approximately 
parabolic  nature  of  all  of  the  governing  equations  other  than  that  of  the  streamfunction.  Relaxation 
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is,  of  course,  a  suboptimal  algorithm  for  solviog  even  sparse  linear  systems.  In  order  to  progress  to 
better  linear  solvers  as  we  progress  to  denser  meshes,  we  are  presently  experimenting  with  a  block- 
preconditioned  generalized  minimum  residual  (GMRES)  method  [10]  in  our  serial  code.  However, 
as  will  be  noted  below,  the  dominant  cost  of  running  the  code  does  not  necessarily  come  from  the 
linear  algebra,  and  the  block  relaxation  schemes  are  attractive  for  their  simplicity  and  their  ease  of 
parallelization. 

The  complexity  of  the  governing  equations  precludes  analytic  expression  of  the  elements  of  the 
Jacobian;  therefore  a  finite-difference  approximation  to  the  Jacobian  must  be  used.  Extending  the 
ideas  of  Curtis,  Powell  and  Reid  [1],  as  described  in  more  detail  elsewhere  [14],  we  can  form  aill  of 
the  nonzero  elements  from  (9r  -i-  1)  independent  vector  residual  evaluations.  Because  of  the  large 
number  of  function  evaluations  required  to  form  the  Jacobian,  its  formation  may  consume  nearly 
all  of  the  the  running  time  of  the  code.  To  alleviate  this  burden  somewhat,  we  adopt  the  practice  of 
not  re-evaluating  the  detailed  transport  coefficients  during  Jacobian  formation  but  lagging  them  at 
their  vaJues  from  the  previous  Newton  step  residual  evaluation.  The  severity  of  this  approximation 
on  the  nonlinear  convergence  rate  of  the  algorithm  in  comparison  to  the  other  major  Jacobian 
approximations  (the  lagging  of  the  Jacobian  itself  over  several  Newton  steps,  and  its  approximate 
evaluation  from  finite  differences)  has  not  been  studied,  mainly  because  the  control  experiment 
(fresh  evaluation  of  the  transport  terms  with  each  Jacobian)  is  prohibitively  expensive. 

4.  Parallelisation  Analysis 

4.1.  Parallel  Implementation  Issues 

The  large  amount  of  arithmetic  to  be  performed  at  each  gridpoint  requiring  data  defined  only 
at  that  gridpoint  suggests  the  potential  for  easy  parallelization  of  the  detailed  kinetics  reacting 
fiow  computation.  However,  there  are  potentially  three  penalties  to  be  paid  in  distributing  the 
overall  relaxation-based  elliptic  solution  algorithm  over  an  array  of  independent  processors:  syn¬ 
chronization  overhead,  communication  overhead,  and  degradation  of  convergence.  These  penadties 
are  measured  indirectly  through  the  speedup  and  efficiency  figures-of-merit  of  a  parallel  implemen¬ 
tation.  The  speedup  is  the  ratio  of  the  uniprocessor  execution  time  of  a  given  algorithm  to  that 
of  the  multiprocessor  execution  of  the  same  algorithm.  I'he  efficiency  is  the  speedup  divided  by 
the  number  of  processors.  For  many  algorithms,  these  definitions  are  unnatural  in  the  sense  that 
one  would  never  use  the  same  algorithm  in  both  uniprocessor  and  multiprocessor  environments. 
(Usually  better  uniprocessor  algorithms  exist,  so  the  parallel  efficiency  as  strictly  defined  is  infiated 
relative  to  its  advantage.)  We  adopt  measures  in  which  the  execution  times  are  obtained  from  the 
most  natural  algorithm  for  each  environment. 

The  synchronization  penalty  arises  if  the  processors  have  data  dependencies  which  cause  them 
to  cycle  between  computation  and  idle  time.  Even  if  the  processors  are  programmed  homogeneously 
this  can  happen  at  convergence  checkpoints,  for  instance,  or  at  other  points  where  an  exchange 
of  data  is  required,  if  they  have  unequal  amounts  of  work  to  do.  The  amount  of  work  to  be 
done  in  a  processor  is  a  function  of  the  number  of  gridpoints  in  the  subdomain  assigned  to  it, 
the  type  of  discrete  equations  to  be  enforced  at  those  gridpoints,  and  possibly  variations  in  the 
character  of  the  solution  from  one  gridpoint  to  another  at  which  the  same  equations  are  enforced. 
(As  instances  of  the  latter,  consider  the  nondeterministic  arithmetic  complexity  of  obtaining  some 
physical  parameter  required  in  the  governing  equations  by  interpolation  from  a  table  or  by  solution 
of  a  small  nonlinear  system  of  equations  at  each  grid  point.)  Assuming  that  the  grid  points 
are  allocated  to  the  processors  as  evenly  as  possible  (within  shape  and  contiguity  constraints)  at 
the  beginning  of  each  macrocycle  between  grid  adaptations,  synchronization  delays  are  relatively 
unimportant  in  the  particular  problems  we  need  to  consider.  The  relative  number  of  boundary 
gridpoints  (which  require  somewhat  less  computational  work  than  interior  ones)  decreases  as  the 


8 


mesh  is  refined,  atnd  though  their  distribution  between  the  processors  becomes  more  uneven,  only  a 
small  number  of  processors  are  thus  kept  waiting.  Similarly,  the  number  of  arithmetic  operations  at 
the  interior  gridpoints  does  not  vary  significantly  despite  the  widely  varying  solution  in  our  detailed 
kinetics  formulation. 

The  communication  penalty  is  the  time  spent  exchanging  messages  between  neighboring  pro¬ 
cessors  which  depend  on  common  data.  The  significance  of  this  penalty  can  be  great  in  the  present 
context  and  depends  on  the  absolute  amount  of  data  to  be  communicated,  on  the  granularity  of  the 
messages  into  which  it  is  embedded,  on  its  routing  between  processors,  on  the  amount  of  arithmetic 
which  the  algorithm  must  perform  between  exchanges,  and  on  the  communication-to-computation 
speed  ratios  of  the  hardware  in  question.  Two  essentially  different  types  of  exchanges  are  required 
when  the  seriad  code  described  in  the  previous  section  is  parallelized  by  domadn  decomposition. 
For  the  nine-point  second-order  finite-difference  stencil,  evaduation  of  the  governing  equation  resid¬ 
uals  on  domains  decomposed  into  strips  requires  two  (vector)  exchanges  of  data  per  processor  per 
iteration,  and  boxwise  decompositions  require  eight  such  exchanges.  Apairt  from  these  pairwise 
exchanges  between  processors  working  on  adjacent  subdomains,  globad  exchanges  aire  required  for 
the  inner  products.  The  price  of  distributed  memory  is  therefore  heavy  local  traffic  and  light  globad 
traffic. 

In  an  effort  to  keep  the  synchronization  and  communication  overheaul  down,  natural  tradeoffs 
may  exist  which  degrade  the  convergence  rate  of  the  algorithm  by  substituting  reaidily  available 
data  for  the  best  possible  data.  In  the  context  of  relaixation  methods,  decreasing  the  degree  of 
implicitness  from  a  Gauss-Seidel  method  to  a  Jacobi  method  is  a  naturad  way  of  decoupling  an 
inherently  sequentiad  algorithm  into  independent  subtasks.  This  decoupling  may  be  introduced 
gradually  as  a  function  of  the  granularity  of  the  decomposition,  e.g.,  with  two  processors  one  may 
employ  block  Jacobi  between  the  subdomadns  but  Gauss-Seidel  within  the  subdomauns.  It  can 
be  shown  theoretically  [4]  for  the  model  problem  of  the  scalar  Poisson  equation  on  the  uniformly 
gridded  unit  square  that  the  asymptotic  convergence  rates  of  the  the  finest  grainulatrity  point  or 
line  Jau:obi  methods  are  half  those  of  the  respective  Gauss-Seidel  methods,  so  that  twice  as  many 
iterations  aure  required  for  an  equal  error  reduction.  Therefore,  am  approximate  upper  bound 
on  the  practicad  pairadlel  efficiency  for  a  processor-saturated  Jacobi  method  applied  to  the  model 
problem  is  50%.  However,  the  lineair  systems  which  obtain  in  the  reacting  flow  context  are  not 
necessarily  diagonally  dominamt  amd  must,  from  experience,  be  under-relaxed.  The  convergence 
penadty  incurred  when  under-relaxed  iterations  are  decoupled  is  less  than  that  borne  by  fully- 
relaxed  iterations,  as  the  test  data  of  section  5  confirms.  Therefore,  efficiencies  of  greater  tham  50% 
at  the  processor-saturated  limit  are  certainly  not  ruled  out  by  this  tradeoff. 

In  view  of  the  above  considerations  as  well  aa  programming  convenience,  our  pauadlel  imple¬ 
mentation  consists  of  decomposing  the  logical  tensor  product  computational  domain  into  logically 
congruent  strips  or  boxes  of  contiguous  unknowns,  mapping  these  subdomains  in  as  neighbor¬ 
preserving  a  manner  as  possible  onto  network  of  processors,  and  programming  the  processors  ho¬ 
mogeneously.  This  equidistributes  load,  minimizes  the  size  and  distance  of  the  messages,  and 
simplifies  the  communication  patterns.  For  stripwise  decompositions,  the  natural  candidate  map¬ 
pings  are  onto  a  ring  of  processors,  a  binary-reflected  Gray  code  ring  embedded  in  a  hypercube,  and 
a  bus-connected  shared  memory  machine.  For  boxwise  decompositions,  the  natural  mappings  are 
onto  a  toroidal  mesh  of  processors,  a  binary-reflected  Gray  code  mesh  embedded  in  a  hypercube, 
and  a  bus-connected  shared  memory  machine.  The  distributed  memory  mappings  are  chosen  so 
that  subdomains  sharing  common  sides  are  assigned  to  processors  which  are  directly  connected. 
The  only  algorithmically  significant  difference  between  the  distributed  memory  architectures  for  a 
given  decomposition  is  then  in  their  overall  diameters,  that  is  the  maximum  distance  in  message 
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links  between  any  two  processors,  since  all  processors  must  cooperate  in  inner  products  and  con¬ 
vergence  decisions.  For  the  ring,  toroidal  mesh,  and  hypercube,  these  diameters  are  p/2,  y/p,  and 
logsp,  respectively.  See,  e.^.,  (11,  5]  for  fuller  descriptions  of  these  interconnection  topologies. 

4.2.  Parallel  Complexity  Analysis 

A  complexity  analysis  of  a  fairly  involved  computational  procedure  is  most  clearly  approached 
in  a  modular  fashion.  Accordingly,  we  begin  by  constructing  operation  counts  for  the  five  major 
computational  subtasks  listed  at  the  end  of  section  2,  then  combine  them  in  proportion.  A  total 
of  15  parameters  are  needed  to  descri’^e  a  model  with  sufficient  versatility  to  cover  the  region  of 
parameter  space  in  which  we  would  like  to  use  the  model  now  and  in  the  neax  future.  Better 
complexity  models  involving  yet  further  parameters  are  possible,  but  are  not  necessary  to  obtain 
the  basic  order-of-magnitude  results  sought. 

We  begin  with  a  fairly  general  model  for  the  computer  itself,  since  specification  of  the  design 
of  hardware  well-suited  to  parallel  reacting  fiow  computations  is  one  of  our  goals.  A  network  of  p 
homogeneous  processor  elements  is  assumed,  each  of  which  has  sufficient  local  memory  to  represent 
the  data  of  the  associated  subdomun.  We  define  7  as  the  unit  of  scalar  floating  point  operation 
time.  To  be  precise,  we  count  the  scalar  multiply-add  operation  as  one  7.  A  scalar  division  is 
counted  with  the  same  weight,  and  an  a  scalar  exponential  is  assumed  to  require  67.  Any  vector¬ 
processing  capability  of  these  elements  is  ignored.  (The  savings  of  vectorization  eire  not  negligible, 
even  as  regards  the  transport  and  source  terms  f,  but  taking  advantage  of  this  capability  would 
require  substantial  recoding  of  our  present  packages,  along  with  a  more  complicated  complexity 
model.) 

The  processors  communicate  by  passing  messages  or  accessing  shared  memory  through  a  global 
bus,  depending  upon  the  interconnection.  We  define  or  as  the  time  overhead  to  initiate  and  route 
a  message  of  any  length  between  a  pair  of  neighboring  processors  (which  we  neglect  entirely  for 
the  shared  memory  machine)  and  0  as  the  additional  transfer  time  per  floating  point  word  (which 
is  simply  the  reciprocal  of  the  bus  bandwidth  for  the  shared  memory  machine).  For  simplicity, 
we  assume  that  only  sending  messages  (or  depositing  data  in  the  shared  memory)  takes  time. 
Receiving  (or  reading  data)  is  assumed  to  be  free.  Since  all  sent  messages  are  presumed  received 
at  some  point,  any  additional  reception  cost  which  ought  to  be  included  to  model  a  particular 
machine  can  be  incorporated  by  simple  readjustment,  of  a  and  P  (e.g.,  doubling  them  if  the  two 
costs  are  comparable).  We  define  three  architecture-dependent  functions  of  p:  Cex,»  C'ei.o 
the  local  exchange  coefficients  for  processors  sharing  neighboring  sides  and  corners,  respectively, 
and  Cftej  global  reduction  coefficient  These  coefficients  multiplying  a  and  0  take  into  account 
the  degraded  ability  of  the  network  to  exchange  data  as  the  number  of  processors  grows.  In  the 
case  of  the  ring,  toroidal  mesh,  and  hypercube,  Cex,i  is  one  regardless  of  the  number  of  processors 
because  there  are  channels  solely  devoted  to  these  message  routes  in  the  network.  In  the  case  of  the 
shared-memory  machine,  the  bandwidth  to  memory  per  processor  must  be  divided  by  the  number 
of  processors;  hence  Cex,»  is  p  itself.  In  the  case  of  the  toroid2Ll  mesh  and  hypercube,  the  two 
distributed  memory  configurations  .'or  which  it  makes  sense  to  introduce  boxwise  decompositions 
(and  thus  neighbors  at  the  corners),  Cex^c  is  2.  This  is  because  there  are  no  devoted  channeb  for 
these  message  routes  and  each  corner-to-corner  message  must  be  routed  through  a  mutual  neighbor. 
(Thb  b  a  conservative  assumption;  we  ignore  the  fact  that  in  some  systems,  the  overhead  of  sending 
an  k-hop  message  b  less  than  k  times  the  overhead  of  sending  a  direct,  or  1-hop  message.)  On  a 
bus-connected  machine,  messages  have  no  preferential  channeb,  hence  CEx,e  b  the  same  as  Cex,$, 
namely  p.  The  global  reduction  operation  produces  a  single  scalar  from  scalars  defined  on  each 
processor,  whether  by  addition  in  the  case  of  a  global  inner  product  whose  partial  sums  are  locally 
accumulated,  or  by  the  logical  "and”  operation  in  the  case  of  a  convergence  test.  Our  reduction 
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algorithms  (which,  though  simple,  are  beyond  the  scope  of  this  study;  see  [11,  5]  for  a  fuller 
discussion)  are  close  to  optimal  for  a  reasonable  “common  denominator”  of  network  software.  For 
the  ring,  toroidal  mesh,  and  hypercube,  Cju  is  twice  the  diameter  of  the  network;  it  is  2p  for  the 
shared-memory  machine. 

The  complexity  model  depends  in  addition  on  basic  discrete  problem  dimensions:  the  resolution 
of  the  grid  n  (hereafter  assumed  the  same  in  each  direction  for  simplicity),  the  number  of  points  in 
the  finite  difference  stencil  q,  the  number  of  components  (or  governing  equations)  per  gridpoint  r, 
and  the  number  of  parameters  per  gridpoint  a.  (Though  q  is  “hard-coded”  as  nine  in  the  present 
context,  the  formulae  below  are  also  valid  for  second-order  schemes  without  cross-derivatives,  for 
which  q  —  h.  Generalization  to  higher-order  schemes  is  not  automatic,  since  an  expanded  discrete 
stencil  would  require  that  the  processors  share  not  only  the  boundary  data  of  their  associated 
subdomains,  but  successively  more  interior  data  as  well.  ) 

In  analogy  to  the  machine-dependent  coefficients,  there  are  coefficients  multiplying  the  basic 
problem  dimensions  which  depend  on  the  detaib  of  the  physics  and  chemistry  incorporated  into 
the  governing  equations.  A  wide  variety  of  such  physical  models  can  be  represented  by  simply 
distinguishing  between  three  types  of  terms:  a  per-stencil-point-per-component  operation  count 
CcD  (whose  main  contribution  is  from  the  convective-diffusive  terms),  a  per-component  operation 
count  Cat  (whose  main  contribution  is  from  the  Arrhenius  source  ternu),  and  a  per- parameter 
operation  count  Ctt  (into  which  all  of  the  transport  property  calculations  at  eaudi  grid  point  are 
lumped). 

In  terms  of  these  parameters,  the  major  algorithmic  subtasks  have  the  following  complexities. 
DAXPYs  (strips  or  boxes): 


The  DAXPY  requires  no  communication  and  parallelizes  perfectly  over  any  partitioning. 
Global  Norms  (strips  or  boxes); 


7  +  jC'ile)  O'  +  \^Rt] 


The  norm  requires  local  inner  products  followed  by  a  global  reduction  of  the  partial  sums  with 
addition.  In  the  spirit  of  order-of-magnitude  estimation,  here  amd  below,  only  arithmetic  which  is 
done  at  every  gridpoint  will  be  considered.  (Thus,  we  do  not  consider  the  compauratively  negligible 
cost  of  normalizing  the  norm,  for  instamce.)  However,  scalar  communication  can  be  nearly  as 
expensive  as  vector  communication  when  a  is  large.  Thus,  no  communication  is  negligible  a  priori. 

Block  Relaxation  ^  -trips): 


j  ^  ®  +  l2nrCEx,s  +  Cjie] 

Block  Relaxation 


(gr*  +  4r) 


7  +  [ACex.»  +  (g  -  5)CE,,e  +  CRelof  + 


+  {q-  ^)rCEx,e  +  C'r* 
n/P 


0. 


The  arithmetic  cost  of  eaudi  sweep  of  the  block  relaxation  reflects  the  formation  of  a  modified  right- 
hand  side  (which  includes  a  term  in  from  eaK;h  Jacobian  block  which  is  not  treated  implicitly). 
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back-substitution  using  pie-factored  Jacobian  blocks  (either  block  tridiagonal  or  block  diagonal, 
additional  terms  in  r^),  a  DAXPY  update  to  the  solution  vector,  and  a  convergence  check.  In 
the  case  of  the  stripwise  decomposition,  each  pair  of  neighboring  processors  must  exchamge  their 
bounding  rows  of  data  consisting  of  n  gridpoints.  Neighbors  in  boxwise  decompositions  are  of  two 
types:  those  that  must  exchange  bounding  rows  of  data  consisting  of  n/^  gridpoints,  and  those 
that  must  exchange  corner  points  of  data  only.  (The  latter  drop  out  when  a  5-point  stencil  b 
employed.) 

Function  Evaluation  (strips): 


—{CcD^r  +  CArf  +  CtfS) 
P 


'1  +  a  +  [2n(r  +  s)^^,,,] 


Function  Evaluation  (boxes): 


—  {CcD^f^  +  CatT  +  C^pS) 
P 


1+ 


[^Ex,*  +  (?  -  5)C'E*,e]a  -h  ^)^Ex,x  +  (?  “  5)(r  +  «)(?£;*, ej  0. 

The  function  evaluation  arithmetic  cost  b  independent  of  decomposition,  but  the  communication 
requirements  are  different  for  the  same  geometrical  reasons  discussed  in  the  block  relaxation  step. 

Jacobian  Evaluation  (strips): 


((?*■  +  +  Catt)  +  2qr*  +  2r  -i-  7+ 

[(2(gr  +  1)  +  (g  -  3))C£:,,,]a  +  [(2(gr  +  l)n(r  -I-  s)  +  (?  -  3)nr*)C£,,,]  /9. 


Jacobian  Evaluation  (boxes): 


_2  /  1 
—  Ugr  +  l)iCcDgr  +  Catt)  +  2qr^  -H  2r  -I-  -r® 


)] 


[4{qr  +  2)Ce,^  +  (?  -  5)(?r  +  2)C£:,,e]o-l- 


4-~{{qr  -H  l)(r  -»-  s)  +  r^)C£,.,  +  (q  -  5)((ffr  -I-  l)(r  +  s)  -f-  r^)CEx,e 
.  VP 


0. 


The  Jacobian  evaluation  cost  includes  {qr  4-  1)  independent  function  evaluations,  in  which  the 
dependence  on  the  detailed  transport  through  Ctt  is  neglected,  along  with  some  finite  difference 
arithmetic  performed  on  the  function  evaluations  to  generate  the  matrix  elements,  and  some  pre¬ 
processing  which  b  ^unortbed  over  all  of  the  block  relaxation  sweeps  utilizing  the  Jacobiam.  The 
stripwise  method  preprocessing  cost  of  7r^/3  reflects  an  LU-block  factorization  of  the  diagonal 
blo^  and  the  solution  of  r-dimensional  linear  systems  involving  the  columns  of  the  two  neighbor¬ 
ing  off-diagonad  blocks.  In  the  boxwbe  method  (the  most  fine-grained  limit  of  which  b  block-point 
relatxation)  the  preprocessing  cost  b  just  r^/3  per  gridpoint  for  fau:torization  of  the  diagonal  block. 
The  communication  terms  include  both  the  shairing  of  subdomain  boundary  data  needed  for  the 
function  evaduation  and  the  shipping  of  Jau;obiam  blocks  between  neighboring  processors.  The  latter 
b  necessauy  in  order  to  exploit  the  Curtb-Powell-Reid  (CPR)  technique  refered  to  in  section  3.  In 
the  CPR  method,  it  b  easier  to  compute,  e.j.,  at  gridpoint  (i,j)  tham  it  b  to  compute 

A  F‘ 

~ ,  where  Fij  represents  any  of  the  governing  equation  residuals  at  the  point  and  amy  of 
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9F' 

the  unknowns.  However,  it  is  which  is  actually  needed  locally  for  the  linear  algebra.  K  the 

point  (t,j)  in  question  lies  at  the  low-j  end  of  a  subdomain  then  this  quantity  is  formed  at  the 
high-j  end  of  the  neighboring  subdomain  as  and  so  forth,  then  the  blocks  are  swapped.  This 

approach  is  justified  since  the  block-swapping  requires  no  arithmetic  and  it  does  not  contribute  to 
the  leading  communication  term  in  a  when  r  is  large.  The  extra  function  evaluations  which  would 
otherwise  be  required  would  contribute  to  the  leading  terms  in  eamh  of  the  brackets  multiplying  7, 
a,  and  p. 

If  the  linear  systems  involving  the  Jacobian  are  solved  to  a  level  of  convergence  which  is 
independent  of  the  granularity  of  the  decomposition,  then  the  number  of  hybrid  Newton  steps  is 
independent  of  the  type  of  decomposition  and  the  number  of  processors,  f  We  may  therrfore 
use  the  hybrid  Newton  step  as  the  largest  aggregate  in  the  complexity  model,  and  work  out  the 
complexity  on  a  per-call-to-NEWTON  basis.  Neglecting  the  overhead  associated  with  possible 
problem-dependent  damping  and  selection  of  time  steps,  we  can  state  that  each  hybrid  Newton  step 
consists  of  one  DAXPY,  two  norms  (one  of  the  solution  vector  update  for  determining  convergence, 
and  one  of  the  steady-state  residual  norm  for  additional  monitoring),  one  function  evaluation, 
L  relaxation  sweeps,  and  \./K  Jacobian  evaluations.  The  last  two  parameters  of  the  model  are 
therefore  the  average  number  of  relaxation  sweeps  per  Newton  step,  L,  atnd  the  average  number 
of  Newton  steps  between  Jacobian  evaluations,  K.  For  systems  as  complicated  as  (3.1) -(3.7),  no 
theory  exists  on  either  L  01  K.  We  must  therefore  obtain  L  aind  K  from  problem-specific  serial 
experiments,  and  be  content  to  extrapolate  them  into  other  regions. 

To  illustrate  the  usefulness  of  the  complexity  analysis  while  keeping  the  size  of  this  paper  within 
reason,  it  is  convenient  to  focus  on  individual  points  in  physical  parameter  space  and  to  study  the 
parallelizability  as  a  function  of  n,  p,  and  the  dimensionless  ratios  a  s  a/7  and  6  =  ^one. 
For  this  purpose,  we  adopt  the  specific  16-species,  46-reaction  mechanism  of  [9]  for  the  oxidation 
of  methane.  We  then  have  for  the  basic  problem  dimensions:  9  =  9,  r  =  19  (with  unknowns 
. . .  ,Yie),  and  s  =  20  (with  parameters  p,Cp,p,A,Di,D2,...,i>i6).  The  operation 
count  coefficients  are  approximately:  Ccd  ^  CUr  ^  270,  and  Crr  2000.  The  coefficient 
is  obtained  directly  from  counting  the  operations  required  to  evaluate  the  species  and  energy  source 
terms,  taking  the  sparsity  of  the  sums  and  products  of  (3.5)- (3.7)  into  etccount,  and  dividing  by 
r.  The  other  two  coefficients  are  too  difficult  to  estimate  directly  from  the  code  itself,  so  they  are 
fitted  to  Cji,  for  a  particulair  q,  r,  and  s  by  the  simple  formula: 

_  <^CDQr  _  CrrB 
•Mr  7  CD  Tfr 

where  Tat,  Tcoy  amd  Tjr  are  actual  serial  CPU  timings  of  the  respective  portions  of  the  code  which 
performs  the  function  evaluation.  L  200  and  5  are  also  available  from  actual  runs  of  the 
serial  code  on  a  grid  of  approximately  1200  nonuniformly-spaced  points. 

4.3.  Parallel  Performance  Analysis 

With  the  preceding  formulae  and  assigned  values,  we  can  estimate  the  execution  time  T  of  the 
parallelized  solver  (normalized  to  the  floating  point  speed  of  the  processors)  through  a  relation  of 
the  form: 

-  =  A(«iP)  +  <*/a(»,p)  +  bffi{n,p).  (4.1) 

Under  the  additional  assumptions  that  a  s  lOO  and  that  5=1  the  leading-order  communication 
terms  of  can  easily  be  extracted,  leading  to  the  entries  under  A,  afa,  and  bf^  in  Table  1.  Minimizing 

tThi*  condition  mmy  not  be  esttofied  for  reiaxntion  method*  in  which  convergence  U  baaed  solely  upon  the  alse  of  the 
update,  since  the  deteriorating  condition  number  of  the  iteration  matrix  with  finer  granularity  may  allow  a  larger  residual  at 
convergence.  However,  this  has  a  relatively  small  effect  on  the  total  number  of  Newton  steps. 
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T/7  as  a  function  of  p  leads  to  the  optimal  execution  time,  Topt,  for  a  given  n,  a,  and  6.  The  number 
of  processors  which  minimizes  for  each  of  the  six  mappings  described  above  b  given  in  the 
fourth  column  of  Table  1  in  orders  of  n.  The  fifth  column  contains  the  actual  optimum  number 
of  processors  for  the  case  n  =  64  when  all  the  terms  of  the  complexity  model  ue  taken  into 
account,  and  the  last  column  shows  the  fraction  of  the  total  execution  time  which  b  devoted  to 
communication  at  thb  optimum.  (The  value  n  =  64  b  selected  because  one-dimensional  studies  of 
methane-air  diffusion  flames  [8]  suggest  that  thb  should  be  sufficient  resolution  in  either  direction 
for  approximate  grid-independence  of  the  discrete  solution.) 


Mapping 

A/io* 

wmmim 

WESESM 

Popt,d 

Popt 

comm.  frac. 

strips/cube 

MWWSSM 

4.0  logjp 

1.1  n 

0(»») 

64 

0.014 

strips/ring 

■QySI 

2.0  p 

1.1  n 

0(n) 

64 

0.028 

strips/bus 

- 

1.1  np 

64 

0.39 

boxes/cube 

1.1  nV* 

4.0  logap 

mssm 

4096 

boxes/mesh 

1.1  n^p~^ 

4.0  p^'* 

3686 

boxes/bus 

1.1  n^p~^ 

- 

Thble  1:  Optimal  processor  estimates  for  six  subdomaiin-to- 
architecture  mappings,  and  estimates  of  the  fraction  of  the 
total  execution  time  devoted  to  communication  at  the  opti¬ 
mal  p,  assuming  n  =  64,  a  =  100,  and  6=1. 

In  the  case  of  stripwbe  decomposition  on  a  hypercube,  the  optimal  order  of  b  too  optimbtic, 
since  it  the  algorithm  can  only  employ  as  many  processors  as  there  are  strips.  (By  construction, 
1  <  p  <  n  for  the  stripwbe  decompositions  and  1  <  p  <  for  the  boxwbe  decompositions.)  The 
communication  fraction  remains  very  small  (less  tham  2%)  when  thb  decomposition  b  processor- 
saturated  at  p  =  n  =  64.  From  a  user  perspective,  thb  order  of  speedup  reduces  an  hour  of  serial 
execution  time  to  approximately  one  minute. 

On  a  ring,  the  communication  fraction  for  processor-saturated  case  b  only  slightly  worse  (about 
3%).  The  only  difference  in  complexity  between  the  two  cases  b  a  larger  global  reduction  time  due 
to  the  fact  that  the  ring  interconnect  b  less  rich.  However,  global  reductions  form  a  small  proportion 
of  the  total  running  time  in  thb  region  of  parameter  sp8u:e. 

In  the  case  of  stripwbe  decomposition  on  a  bus-connected  shared  memory  machine  the  bus 
becomes  choked  with  local  exchanges  as  p  increases  leading  to  an  optimal  order  of  v}!^.  However, 
thb  b  too  pessimbtic  for  reasonable  values  of  n  and  the  chosen  machine  parameters  since  the 
constant  in  front  b  large  and  the  optimum  actually  lies  on  the  boundary.  Although  the  execution 
time  continues  to  drop  as  processors  are  added  to  the  choked  bus,  the  efficiency  drops  considerably. 
In  the  processor-saturated  case,  nearly  40%  of  the  time  b  devoted  to  communication. 

In  the  case  of  boxwbe  decomposition  on  a  hypercube,  the  optimal  order  estimate  of  v?  (which 
b  derived  from  the  balance  between  and  /«  alone)  b  justified  by  the  full  model,  in  that  64^  = 
4096  processors  can  usefully  be  employed.  However,  at  thb  processor-saturated  limit  the  speedup 
b  limited  since  over  40%  of  the  time  b  devoted  to  communication.  (For  a  =  100  and  6=1, 
the  communication  b  dominated  by  the  global  reduction  operation.)  Nevertheless,  from  a  user 
perspective,  a  processor  airray  of  thb  sbe  could  reduce  about  40  minutes  of  serial  execution  time 
to  approximately  one  second. 

The  other  architectures  do  not  perform  as  well  for  the  boxwbe  decomposition.  The  difference 
between  the  2  log2  p  global  reduction  coefficient  for  the  hypercube  and  the  corresponding  2y/p  for 
the  mesh  b  significant  for  p  on  the  order  of  2’^,  and  the  optimal  order  falb  to  Thb  b  somewhat 
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pessimistic  for  the  problem  under  consideration  because  the  const2mt  in  front  is  greater  than  one, 
and  the  execution  time  is  actually  minimued  at  p  =  3686.  However,  communication  costa  70%  of 
the  execution  time  at  this  point. 

The  bus-connected  shared  memory  machine  becomes  hopelessly  choked  with  a  mere  206  pro¬ 
cessors  under  this  decomposition.  Further  increases  in  the  number  of  processors  actually  increase 
the  execution  time.  At  this  point,  over  50%  of  the  execution  time  is  spent  swapping  data.  The 
optimad  order  of  is  not  much  better  than  in  the  stripwise  decomposition  case. 

The  optimum  number  of  processors  has  been  defined  above  as  the  number  which  minimizes 
the  execution  time.  However,  the  ‘‘optimum”  might  actually  be  smaller  if  the  marginal  efficiency 
of  the  processors  is  small  and  a  cost-weighted  objective  function  is  used.  Because  of  this  important 
consideration,  we  conclude  this  section  with  the  presentation  of  surface  plots  of  efficiency  over  a 
range  of  of  n  and  p  for  the  complexity  model  defined  above.  (For  plotting  clarity,  the  efficiency  is 
set  to  zero  in  the  inaccessible  ranges  of  p  >  n  for  stripwise  decompositions  or  p  >  for  boxwise 
decompositions.)  Results  are  given  for  all  six  mappings  for  the  machine  pairameters  a  =  100  amd 
6=1,  and  for  the  two  hypercube  mappings  (which  are  the  most  promising)  at  a  =  1000  and  6=1, 
in  order  to  check  the  sensitivity  of  their  performance  to  the  message  initiation  overheaul.  Figure  2 
contadns  the  strip-based  mappings,  and  Figure  3  the  box-based  mappings. 

It  is  apparent  from  Figure  2(d)  that  the  hypercube/strip  mapping  can  bury  the  cost  of  initiating 
messages  over  a  rather  slow  network,  whereas  the  performamce  of  hypercube/box  mapping  shown 
in  Figure  3(d)  has  deteriorated  to  possibly  uneconomicad  levels.  Further  calculations  (not  plotted) 
show  that  the  hypercube/strip  mapping  maintauns  greater  than  70%  efficiency  even  at  a  =  10,000. 

In  all  the  calculations  of  this  section,  L  was  set  to  the  constamt  vatlue  200  obtained  from  serial 
runs  with  the  block-line  algorithm.  This  crude  akssumption  potentiadly  endangers  the  results,  making 
them  appear  too  optimistic  at  lairge  n  and  p,  since  L  appeaurs  lineaurly  in  the  leading  terms  of  the 
communication.  (It  does  not  appear  in  the  leading  terms  of  the  aurithmetic,  since  the  Jacobian  amd 
function  evaluations  dominate.)  By  analogy  with  a  scalaur  Poisson  equation,  L  might  be  expected 
to  increase  like  at  a  fixed  p,  by  as  much  as  a  factor  of  2  at  fixed  n  over  the  full  ratnge  of  p  for 
the  stripwise  decomposition,  amd  by  as  much  as  a  factor  of  4  over  the  full  range  of  p  in  the  boxwise 
decomposition  case  (see,  e.g.,  [4]).  However,  experience  indicates  that  these  scalaur  Poisson-based 
iteration  estimates  au'e  pessimistic  for  the  Jatcobiams  of  two-dimensional  strongly  convective  systems 
of  reaurting  flow  equations.  Still,  to  be  conservative,  these  optimal  p  estimates  should  be  regarded 
as  valid  only  in  the  neighborhood  of  the  paurauneter  spaure  in  which  they  were  derived.  In  particulau-, 
they  axe  not  uniformly  valid  in  n.  In  order  to  model  performance  conservatively  over  a  wide  region 
of  pairameter  spaure  with  the  convenient  assumption  of  a  constant  L,  the  worst-case  L  for  the  region 
(based  on  its  extreme  point  in  n  and  p)  should  be  used.  This  number  is  not  yet  experimentally 
available  to  us  for  n  =  64,  the  extreme  resolution  in  Figures  2  and  3. 

S.  Parallel  Exparimants  for  Modal  Problams 

The  solution  algorithm  described  in  section  2  has  been  implemented  on  the  32-processor  Intel 
iPSC-5  (with  the  expamded  local  memory  configuration)  and  some  results  are  presented  in  this 
section  for  a  few  simple  choices  of  F(^).  These  results  serve  three  main  purposes:  (i)  they  validate 
the  parallel  code  itself,  (ii)  they  provide  a  test  for  the  complexity  theory  of  section  4  in  the  case  of 
constamt  £,  and  (Hi)  they  illustrate  that  the  dependence  of  the  overall  performamce  of  the  code  on 
variations  in  L  with  n  or  p  is  wealcer  than  one  would  expect  from  the  model  problem,  under  various 
conditions  which  typify  our  reacting  flow  calculations.  These  conditions  aire:  under-relaucation  of 
the  lineam  solver,  strong  (upwinded)  convection,  and  lairge  r  with  strong  coupling  of  the  r  unknowns 
at  eatch  point. 

The  model  Poisson  problem  is 


-V*«  =  / 
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on  the  unit  square  with  homogeneous  Dirichlet  boundary  conditions,  where  f{x,y)  is  chosen  so 
that  the  exact  solution  is  u(x,y)  =  sin(irx)sin(iry).  The  initial  iterate  is  u  =  0.  The  stripwise 
decomposition  is  applied  to  a  uniform  32x32  grid,  using  1, 2, 4,  8, 16,  and  32  processors.  Because  the 
problem  is  linear,  convergence  is  obtained  in  one  Newton  step.  An  absolute  convergence  toleramce  of 
10~*  on  the  normalized  2-norm  of  the  update  is  applied  to  the  linear  iteration,  in  which  a  relaxation 
factor  of  unity  is  employed. 

The  results  appear  in  Table  2.  The  first  three  rows  display  convergence  data:  the  number 
of  iterations  required  for  a  given  granularity  of  the  decomposition,  this  number  normalized  to  the 
uniprocessor  case,  and  the  relative  reduction  of  the  2-norm  of  the  residual  at  “convergence” .  One 
observes  that  the  number  of  iterations  required  for  the  processor-saturated  p  —  32  (block-line 
Jau:obi)  case  b  nearly  double  that  of  the  p  =  I  (block-line  Gauss-Seidel)  case.  It  is  less  than 
the  theoretical  asymptotic  factor  of  two  since  this  factor  is  based  on  equal  error  reduction,  and 
such  a  condition  is  not  guaranteed  with  an  update-based  convergence  criterion.  The  next  three 
rows  give  overall  performance  measures:  the  execution  time  in  seconds  (which  includes  a  single 
Jacobian  formation  at  the  outset),  the  relative  speedup  obtained  from  each  doubling  of  the  number 
of  processors  (ideally  2  each  time),  and  the  efficiency  (E  =  Ti/(pTp)).  Though  the  speedup  is 
respectable  (from  a  little  over  half-an-hour  to  three  minutes),  the  efficiency  is  fairly  dismal  (32%) 
at  the  processor-saturated  limit  due  to  two  factors:  the  large  communication-to-arithmetic  ratio 
of  the  r  s  1  problem  and  the  increase  of  L  with  p.  (The  average  number  of  relaixation  sweeps 
per  Newton  step  L  is,  of  course,  the  same  as  the  tabulated  I  for  this  linear  problem.)  The  latter 
dependence  can  be  removed  by  considering  the  execution  time  per  iteration.  Performamce  data 
based  thereon,  amalogous  to  the  middle  three  rows,  appears  in  the  last  three  rows  of  Table  2. 
By  this  measure,  the  processor-saturated  efficiency  is  58%,  amd  its  departure  from  unity  is  due 
exclusively  to  communication-related  factors. 


p=  1 

2 

4 

8 

16 

32 

Ip 

451 

477 

501 

550 

646 

832 

hll\ 

1.00 

1.06 

1.11 

1.22 

1.43 

1.85 

lk/,||/||ro|| 

9.5(-4) 

1.0(-3) 

1.2(-3) 

1.4(-3) 

1.6(-3) 

2.0(-3) 

Tp  (sec.) 

1840. 

974.4 

539.3 

314.1 

Tp-ilTp 

- 

1.89 

1.81 

1.71 

E 

1.00 

0.94 

0.85 

0.32 

tp  (sec.) 

4.08 

2.04 

1.08 

0.33 

0.22 

^p-iAp 

- 

2.00 

1.89 

1.89 

1.73 

1.50 

e 

1.00 

1.00 

0.94 

0.89 

0.77 

0.58 

Table  2:  Intel  Hypercube  data  for  the  model  linear  problem, 

—  V*«  =  /  in  the  unit  square,  with  u;  =  1.0,  for  different  num¬ 
bers  of  processors,  p.  Ip  b  the  number  of  iterations  required 
for  convergence,  ||%||/||rol|  b  the  ratio  of  final  to  initial  resid- 
uab,  Tp  b  the  execution  time,  E  b  the  overall  efficiency  of  the 
p-processor  algorithm,  tp  b  the  per  iteration  CPU  time,  and 
e  b  the  per  iteration  efficiency  of  the  p-processor  algorithm. 

The  last  row  of  the  table  can  be  directly  compared  with  a  theoretical  efficiency  surface  plot 
for  the  hypercube/strip  mapping  of  the  model  problem,  which  appears  in  Figure  4(a).  Thb  plot 
was  generated  using  the  complexity  model  of  the  previous  section  with  9  =  9  r  =  1,  s  =  1, 

tThoufh  th«  mod«i  probtem  could  be  differenced  with  •  flvo-point  formula,  the  nin^point  Jacobian  ie  preeently  hard-coded 
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CcD  ~  1-3,  Cxr  =  29  I,  Cxr  =  Q,  L  —  451,  and  K  =  1,  along  with  a  =  1000  fisec,  =  8  /isec,  and 
7  =  40  ^sec  or  a  =  25,  6  =  0.2.  An  interesting  trend  which  is  barely  discernable  in  Figure  2  (b) 
and  (c),  but  readily  apparent  in  Figure  4(a)  is  the  increase  in  parallel  eflSciency  with  n  along  the 
processor-saturated  diagonal  (p  =  n).  The  leading-order  arithmetic  terms  grow  faster  with  n  than 
the  leading-order  communication  terms  for  the  hypercube/strip  mapping,  so  that  paradlel  efficiency 
improves  with  grid  refinement.  We  emphasize  that  this  is  an  artifact  of  a  constant  L. 

Figure  4(b)  contains  two  sections  of  the  surface  in  (a),  showing  the  theoretical  efficiency  at 
two  fixed  values  of  n  as  p  varies  from  1  to  n.  Plotted  as  circles  on  (b)  are  the  data  from  the 
last  row  of  Table  2  and  analogous  results  on  an  n  =  16  problem  (not  separately  tabulated).  The 
agreement  of  the  crude  complexity  model  with  experiments  is  better  tham  one  has  a  right  to  expect. 
We  emphasize  that  we  would  never  use  a  relaxation  algorithm  to  solve  this  simple  problem,  even 
in  parallel,  since  many  efficient  alternative  solution  techniques  exist  and  have  been  parallelized 
(FFTs,  multigrid,  conjugate-gradient- based  domain  decomposition,  etc.).  However,  it  is  a  useful 
test  problem  since  reacting  fiow  problems  often  possess  regions  in  which  diffusion  dominates  or 
co-dominates.  In  addition,  the  streamfunction  equation  is  simply  a  variable-coefficient  Poisson 
equation. 

Table  3  illustrates  the  decline  in  the  iteration  penalty  associated  with  loss  of  implicitness  when 
the  updates  are  under-relaxed.  The  problem  is  the  same  as  the  previous  one,  except  that  cj  =  0.5, 
which  is  the  value  we  are  typically  forced  to  use  in  solving  systems  based  on  the  reacting  flow 
Jacobian.  Though  the  total  number  of  iterations  is  higher  at  all  granularities,  it  grows  more  slowly, 
reaching  a  maximum  of  1.28  times  the  uniprocessor  total,  rather  than  1.85  for  the  fully-relaxed 
calculation  in  Table  2.  The  overall  parallel  efficiency  is  accordingly  higher.  The  per-iteration 
efficiency  data  is  identical  (to  within  the  precbion  of  the  table)  to  that  of  Table  2,  since  L  b 
normalized  out,  and  therefore  is  not  repeated  here. 


p=  1 

2 

4 

8 

16 

32 

1192 

MSSM 

1235 

1278 

1362 

1529 

hlh 

1.00 

■ESI 

1.04 

1.07 

1.14 

1.28 

lk/J|/l|ro|| 

2.9(-3) 

3.2(-3) 

3.3(-3) 

3.5(-3) 

4.0(-3) 

Tp  (sec.) 

4870. 

2485. 

1334. 

335.2 

Tp-jTp 

- 

1.95 

1.86 

1.35 

E 

1.00 

0.98 

0.91 

0.67 

0.45 

l^bl«  S:  Intel  Hypercube  data  for  the  model  linear  problem, 

—  V^ti  =  /  in  the  unit  square,  with  w  =  0.5,  for  different 
numbers  of  processors,  p.  Ip  is  the  number  of  iterations  re¬ 
quired  for  convergence,  ||r/,||/||ro||  b  the  ratio  of  final  to  ini¬ 
tial  residuab,  Tp  b  the  execution  time,  and  E  b  the  overall 
efficiency  of  the  |>-proces8or  algorithm. 

Table  4  illustrates  the  decline  in  the  iteration  penalty  which  b  normally  associated  with  elliptic 
character  when  the  system  b,  in  fact,  approximately  parabolic  and  the  relaxation  sweeps  are  done 
in  the  proper  direction.  The  problem  (from  [2])  b 


-V*u  +  cUp  =  0 


into  th«  aolver. 

tThe  Cod  and  operation  count  coeffleienta  mre  baaed  on  detailed  timing  teata  on  a  aingle  iPSC  node.  The  Cat  term 
includea  the  coot  of  equating  the  aine  functiona  in  /. 

1  Double  preciaion  ia  perfonned  in  aoftwara. 
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on  the  unit  square  with  the  inhomogeneous  Dirichlet  boundairy  conditions  u(0,y)  =  u(l,y)  =  1, 
u(z,0)  =  1  +  sin(irz),  tt(ar,  1)  =  1  +  2sin()rx),  whose  solution  possesses  a  downstream  boundaury 
layer  whose  resolution  requirements  we  ignore.  The  stripwise  decomposition  is  applied  to  a  uniform 
32  X  32  grid,  using  1,  2,  4,  8,  16,  and  32  processors.  The  initial  iterate  is  u  =  0.  Since  this  is  simply 
amother  scalair  lineatr  problem,  we  present  only  the  iteration  counts  in  Table  4. 

Values  of  the  dimensionless  Reynolds/Peclet  number  c  in  the  jet  configuration  which  motivates 
this  study  are  typically  in  the  range  10^  to  10®.  Therefore  we  consider  both  magnitudes  below. 
(The  first  row  of  T^ble  2  gives  the  c  =  0  limit.)  In  order  to  indicate  the  effect  of  n  upon  L,  we  also 
consider  two  resolutions,  n  =  16  and  n  =  32,  at  each  Reynolds/Peclet  number. 


p=  1 

2 

4 

8 

16 

32 

Ip  (c  =  10®,  n  =  16) 

19 

23 

27 

34 

49 

. 

Ip  (c=  10®,n  =  32) 

42 

45 

50 

59 

75 

107 

Ip  (c=:  10®,  n  =  16) 

8 

11 

13 

18 

27 

- 

Ip  (c  =  10®,n  =  32) 

12 

14 

17 

22 

32 

51 

Table  4:  Intel  Hypercube  data  for  the  strongly  convective 
linear  problem,  -V®u  +  cti,  =  0  in  the  unit  squau^e,  for  differ¬ 
ent  numbers  of  processors,  p.  Ip  is  the  number  of  iterations 
required  for  convergence  with  c  and  n  as  indicated. 

There  are  three  major  trends  in  Table  4.  The  first  is  that  the  number  of  iterations  required 
in  thb  range  of  c  is  roughly  an  order  of  magnitude  less  than  for  c  =  0,  for  the  same  n  and  p.  In 
nonlinear  problems  especially,  this  translates  into  less  communication  per  Jacobian  and  function 
evailuation,  and  thus  into  higher  efficiencies.  One  also  observes  that  doubling  the  resolution  of  the 
grid  generally  less  than  doubles  the  number  of  iterations  for  a  given  c  and  p.  In  contrast,  for  the 
model  Poisson  problem,  doubling  the  resolution  theoretically  quadruples  the  number  of  iterations. 
In  the  other  limit,  c  — »  oo,  of  course,  the  number  of  iterations  is  independent  of  the  resolution 
altogether,  namely  p  itself.  Thirdly,  the  number  of  iterations  may  more  than  double  for  a  given  c 
and  n  as  p  increases  over  its  full  range,  the  c  — »  oo  limit  being  a  case  in  point.  In  this  respect,  the 
Poisson  problem  has  a  milder  dependence  of  L  upon  p.  However,  in  that  problem,  L  is  much  larger 
to  begin  with. 

Of  course,  the  effect  of  convection  on  the  reacting  Sow  operator  is  far  more  complex  than  can 
be  suggested  here,  since  there  is  one  equation  in  the  governing  system  (3.1)  in  which  convection 
plays  no  role  at  all.  In  addition,  convection  is  not  generally  a  co-dominant  term  within  the  reaction 
zone,  so  that  it  is  insufficient  simply  to  march  through  this  zone  a  small  number  of  times. 

Finally,  to  verify  the  h2indling  of  nonlinearities  and  to  illustrate  improved  overadl  efficiency 
with  large  r,  we  consider  the  problem 

r 

-  V®u*  +  c  exp[^  a««,]  =  /*,  A;  =  1, . . . ,  r 
i-t 

on  the  unit  square  with  homogeneous  Dirichlet  boundary  conditions,  where  =  -1-1  for  1  <  A;  and 
au  =  —1  for  I  >  k,  and  where  the  fk  are  the  same  as  in  the  model  problem,  so  that  all  of  the 
Uk  reduce  to  the  solution  thereto  when  c  =  0.  Here  we  take  c  =  4(n  —  1)®  so  that  the  nonlineu 
term  is  co-dominant.  However,  no  time-stepping  is  employed,  and  no  damping  b  triggered  by  thb 
particular  problem  for  the  initial  iterate  u  =  0. 

The  stripwise  decomposition  b  applied  to  a  uniform  32  x  32  grid,  using  16  and  32  processors, 
for  1,  2,  3,  and  4  unknowns  per  gridpoint.  In  order  to  keep  execution  times  under  one  hour,  smadler 
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numbers  of  processors  (and  larger  numbers  of  unknowns)  were  not  used;  therefore  direct  efficiency 
calculations  are  not  possible.  The  results  appear  in  Table  5.  Notice  that  for  a  given  n  and  p,  L  levels 
off  as  r  increases  and  the  intrapoint  couplings  (which  are  handled  directly)  become  more  important 
relative  to  the  interpoint  couplings  (which  are  not).  The  relative  speedups  are  improving  with  r  at 
thb  processor-saturated  end  of  the  scale  due  to  the  increasing  arithmetic-to-ccmmunication  ratio 
It  is  primarily  this  effect  which  grants  the  high  efficiencies  of  the  parallelized  reacting  flow  solver. 


r  =  1 

2 

3 

4 

N 

15* 

17* 

21 

24 

J 

5 

6 

6 

6 

F 

20* 

23* 

27 

30 

Il6 

927 

1531 

2042 

2332 

hi 

1160 

1926 

2561 

2924 

L\t 

61.8 

90.1 

97.2 

97.2 

Lzi 

72.5 

120.4 

121.9 

121.8 

Ti6  (sec.) 

318.2 

957.4 

2109. 

3576. 

T32  (sec.) 

262.7 

704.3 

1470. 

2406. 

T\t/Tz2 

1.21 

1.36 

1.43 

1.49 

<16  (sec.) 

0.34 

0.62 

1.03 

1.53 

*32  (sec.) 

0.23 

0.37 

0.57 

0.82 

*l«/*32 

1.55 

1.68 

1.81 

1.87 

Table  S:  Intel  Hypercube  data  for  the  model  coupled  nonlin¬ 
ear  problem,  -  c  exp(]Cr«i  =  /fc»  for  ^  =  !>•••>»■ 
in  the  unit  square,  with  u)  =  1.0,  for  four  different  numbers  of 
components,  r  =  1,...,4,  and  two  different  numbers  of  pro¬ 
cessors,  p  =  16,32.  N  is  the  number  of  Newton  iterations, 
J  the  number  of  Jacobian  evaluations,  and  F  the  number  of 
function  evaluations.  Ip  is  the  iteration  count  for  the  linear 
relaxation  sweeps,  totaled  over  all  N  Newton  steps,  and  Lp 
is  the  average  per  Newton  step.  Tp  is  the  CPU  time  in  sec¬ 
onds,  and  tp  is  the  per  iteration  CPU  time  in  seconds.  [*  The 
asterisked  figures  are  for  the  16-processor  case.  Due  to  the 
slight  difference  in  size  of  the  terminal  residuals  of  the  linear 
iterations  within  each  Newton  step  between  the  two  different 
processor  configurations,  the  number  of  Newton  steps  and 
function  evaluations  varied  slightly.  In  the  32-processor  case 
N  and  F  were  each  one  greater  for  r  =  1,  and  one  less  for 
r  —  2.  The  non- asterisked  figures  for  IV,  J,  and  F  are  the 
same  for  both  processor  configurations.] 


6.  Conclnslona 

We  conclude  with  a  discussion  of  an  instance  of  the  “inverse  problem”  of  parallel  computing, 
namely,  “Given  problem  X  and  algorithm  Y,  what  kind  of  parallel  computer  should  it  be  executed 
on?”  For  a  generic  two-dimensional  nonlinear  elliptic  system  characterized  by  very  expensive 
function  evaluations,  a  Newton-like  method  with  a  relaxation-based  solver  of  either  block-line  or 
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block-point  type  forms  a  practical  algorithm.  The  block-line  form  of  the  algorithm  parallelizes  well 
on  a  ring  or  hypercube  network  of  processors,  and  efficiently  employs  any  number  of  processors  less 
than  or  equal  to  n,  the  resolution  in  the  coordinate  direction  transverse  to  the  lines,  while  tolerating 
a  relatively  high  message  initiation  to  floating  point  speed  ratio.  Shared  memory  machines  are  also 
useful  for  this  problem,  particularly  in  the  coatse-grmned  limit.  Thb  implies  that  severad  of  the 
currently  avadlable  parallel  clusters  containing  small  numbers  of  powerful  nodes  will  be  of  great 
interest  to  reaM:ting  flow  modelers. 

To  audiieve  acceptable  running  times,  however,  modelers  ultimately  will  not  be  content  with 
speedups  of  0(n).  The  block-point  form  of  the  algorithm  also  lends  itself  to  parallel  implementation, 
and  extends  the  number  of  processors  that  can  effectively  be  put  to  use  beyond  0(n)  into  the  O(n^) 
range  without  amy  necessity  of  decoupling  the  strongly  interrelated  unknowns  defined  at  a  single 
gridpoint.  The  block-point  version  places  more  restrictions  on  the  type  of  interconnect,  however. 
With  realistic  message  overheauls  a  hypercube  is  required  for  high  performance.  No  hypercubes  of 
the  power  and  dimension  required  for  this  algorithm  exbt  today;  however,  their  development  in 
the  neair  future  b  both  likely  amd  to  be  welcomed. 

We  have  considered  only  two-dimensional  detailed  kinetics  modeb.  Parallelbm  b  not  a  crit- 
icad  issue  in  zero-  or  one-dimensional  detailed  kinetics  modeb,  auid  no  projections  au'e  offered  on 
three-dimensional  modeb.  Though  the  analysb  of  thb  paper  could  easily  be  extended  to  three 
dimensions,  practical  three-dimensional  modeb  will  almost  certainly  require  better  methods  for 
the  discretbation  and  the  linear  algebra.  (Particle-based  methods  with  their  own  very  different 
parallelization  considerations  may  ultimately  replace  continuum-based  methods  in  both  two  amd 
three  dimensions.) 

To  borrow  the  now  famous  phrase  of  C.  Moler,  detailed  kinetics  reaicting  flow  modeling  b  an 
example  of  an  “embarrassingly  parallel”  computation.  The  Newton/time-stepping  hybrid  algorithm 
will  run  well  on  fine-grained  versions  of  technologically  feasible  machines.  With  reasonable  spatial 
resolution,  on  the  order  of  2^*^  or  more  processors  may  usefully  be  employed,  although  it  b  likely 
that  better  algorithms  will  be  developed  before  the  present  one  b  implemented  on  hardware  of  thb 
scale.  The  basic  property  of  reacting  flows  leading  to  the  ease  of  parallelization  -  the  large  amount 
of  zero-space-dimensional  arithmetic  required  in  the  evaluation  of  transport  and  reaction  rate  terms 
-  will  only  be  accentuated  following  present  trends  towards  heavier  fueb  2Uid  the  development  of 
progressively  more  complicated  mechanbms. 
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8.  Appendix 

The  Appendix  contains  procedural  level  language  descriptions  of  the  procedures  SOLVER  and 
NEWTON  described  in  section  2. 


Procedure  SOLyER(  F,  x,  flagsoLVER  ) 

xo  X 

flagjACOBiAN  “refresh” 

^  i^time  ^  0)  Then 
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For  A;  =  1  Step  1  Until  rtf, me  Do 
Repent 

X  <  ^k—l 

Call  NEWTON(  F,  x,  ni,-me>  ^*-1,  Ai*,  flagsEWTONi  f  ^0^9 Jacobian  ) 
If  {flagsEWTON  =  “converged”)  Then 

Xk*-  X 

Break-Repeat 

Else 

Aft  ^  A<i/2 

If  (Aft  <  Afmin)  Then 

flagsoLVER  “damped  to  death  in  transient  mode” 

Return 

End-If 

X  *-  Xk-l 

flagjACOBlAN  ^  “refresh” 

End-If 

End-Repeat 

compute  next  time  step,  Aft+i 
If  (Aft+I  >  Afmax)  Then 
Break-For 
End-If 
End-For 

fiagjACOBiAN  *-  “refresh” 

«<ime  0 

EndJtf 

Call  NEWTON(  F,  x,  rtiimei  ^rt-i,  Aft,  flagsEWTONy  Jacobi  an  ) 

If  (  flag  NEWTON  =  “converged”)  Then 

X  X 

flagsoLVER  “converged  to  steady  solution” 

Else 

flagsoLVER  “damped  to  death  in  steady-state  mode” 

End-If 

Return 

End-Procedure 


Procedure  NEWTON  (  F,  x,  name,  Xk-h  Aft,  flagsEWTONi  /^(^gjACOBiAN  ) 

X  <—  X 

m  1 

For  A;  =  1  Step  1  Until  tinewton  Do 
If  {flagjACOBlAN  =  “refresh”)  Then 
compute  new  Jacobian,  J 
flagjACOBiAN  “recycle” 

End-If 

If  (m  =  1)  Then 

Call  F(  X,  ntimcf  ^k-h  Aft,  F  ) 

Ax  •» - J~^F 

compute  initial  damping  parameter,  A 

e-iiAxii 


^sav€  ® 

Repeat 

X  *—  x  +  AAx 

Ax/*o*  < - J-^F(x) 

^look  *-  ||Axjoo*|| 

^  i^look  ^  ^eonverge)  Then 

fioQSEWTOlf  *—  “^convergedf' 

X  *—  X 

Return 

EleeJf  {^look  <0(2^  And  A  =  1)  Then 
m  ^  2 

^»avt  ^  ^ 

Break  Jtepeat 
Else  Jf  ((,00k  <  0  Then 

flagjACOBlAN  *-  “refresh* 

Break-Repeat 

Else 

A  ^  A/2 

If  (A  <  A„,„)  Then 

f^<igsEWTON  “damped  to  death* 
Return 
End-If 

X  ^  X«av« 

EndJf 

EndJlepeat 

Else 

X  ^  X  +  Axiook 

Call  F(  X,  niifitc)  ^k—it  E  ) 

^^took  —J 

^look 

If  ((took  ^  (eonvorge)  Then 

flt^gNEWTON  “converged* 

X  *—  X 

Return 

Else-If  ((,ook  <  Cim(»av€  And  m  <  Umodified)  Then 
m  <—  m  +  1 

Else 

m  1 

fiogJACOBlAN  *-  “refresh* 

End-If 

End-If 

End-For 

f^o^gNEWTON  *—  “failed  to  converge* 

Return 

EndJ*rocedure 
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Figure  1:  A  schematic  of  the  Jacobiao  matrix  of  the  9-point  finite-difference  operator  showing  its 
sparsity  structure,  (a)  upper  left:  a  two-dimensional  5x6  grid  with  the  stencil  corresponding  to 
(t\j)  =s  (3,3)  highlighted,  (b)  upper  right:  gross  sparsity  structure  of  the  Jacobian,  showing  the 
coupling  between  rows,  (c)  lower  left:  an  enlargement  of  the  cross-hatched  block  of  (b),  showing  the 
coupling  between  points  within  a  row.  (d)  lower  right:  an  enlargement  of  the  double-cross-hatched 
block  of  (c),  showing  the  coupling  between  degrees  of  freedom  within  a  point. 
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Figure  2:  Surface  plots  of  the  parallel  efficiency  of  the  relaxation- based  nonlinear  elliptic  solver 
on  the  methane-air  axisymmetric  jet  diffusion  flame  problem  for  stripwise  decompositions.  The 
resolution  of  the  domain  in  each  dimension,  n,  runs  between  2  and  64.  The  number  of  processors 
employed,  p,  runs  between  1  and  n.  (a)  upper  left:  bus-connected  shared  memory  architecture  with 
6  =  1  (a  =  0).  (b)  upper  right:  ring  architecture  with  a  =  100  and  6=1.  (c)  lower  left:  hypercube 
airchitecture  with  a  =  100  and  6=1.  (d)  lower  right:  hypercube  architecture  with  a  =  1000  and 
6=1. 


Figure  S:  Surface  plots  of  the  parallel  efficiency  of  the  relaxation* based  nonlinear  elliptic  solver 
on  the  methane-air  axisymmetric  jet  diffusion  flame  problem  for  boxwise  decompositions.  The 
resolution  of  the  dom^  in  each  dimension,  n,  runs  between  2  and  64.  The  number  of  processors 
employed,  p,  runs  between  1  and  n^.  (a)  upper  left:  bus-connected  shared  memory  architecture 
with  6  =  1  (a  =:  0).  (b)  upper  right:  mesh  architecture  with  a  =  100  and  6=1.  (c)  lower  left: 
hypercube  architecture  with  a  =  100  and  6=1.  (d)  lower  right:  hypercube  architecture  with 
a  =  1000  and  6=1. 
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p 


Flgur*  4:  Theoretical  (curves)  and  actual  (plotted  circles)  parallel  eflSciency  of  the  relaxation-based 
nonlinear  elliptic  solver  on  the  model  problem  for  stripwise  decompositions  on  the  Intel  iPSC-5.  (a) 
left:  surface  plot  of  theoreticaJ  efficiency  over  2  <  n  <  64,  1  <  p  <  n.  (b)  right:  graph  comparing 
theoretical  and  actual  efficiency  vs.  p  at  n  =  16  and  n  =*  32. 
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